Skip to content

Conversation

@gtauzin
Copy link
Contributor

@gtauzin gtauzin commented May 19, 2025

Description

While putting pipelines in production that rely on the StudyDataset that I wrote in #1021, I noticed that providing it with a password in credentials does not work. The password along with the database name, user name, host and ports are translated into a SQLAlchemy URL that was then casted to a string, resulting in the password becoming "***".

Development notes

  • Remove the casting of the storage URL into a string and keep the SQLAlchemy URL
  • Whenever the actual storage URL is needed, make use of URL.render_as_string
  • Add a test that checks the conversion (and the _study_name_exists method)

In order to try out in a real case, you can make use of the optuna_dist branch of kedro-dagster-example:

git clone --branch optuna_dist [email protected]:gtauzin/kedro-dagster-example.git
cd kedro-dagster-example

Spin the postgresql DB:

docker compose -f docker/pipelines-dev.docker-compose.yml up

In a separate terminal, do:

export POSTGRES_DB=dev_db
export POSTGRES_USER=dev_user                           
export POSTGRES_PASSWORD=dev_password
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export KEDRO_ENV=dev 
uv run kedro run --env KEDRO_ENV

You can change from the local StudyDataset (which is the one I pushed here as well - kedro_dagster_example.study_dataset.StudyDataset) to the one release in kedro-datasets (kedro_datasets_experimental.optuna.StudyDataset) in the conf/dev/catalog.yml to observe the fix.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [] Updated the documentation to reflect the code changes
  • [] Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@gtauzin gtauzin force-pushed the fix/optuna_pwd_db branch from ca8b961 to e827ed9 Compare May 19, 2025 19:46
pruner_config = load_args.pop("pruner")
pruner = self._get_pruner(pruner_config)

storage_url_str = self._storage_url.render_as_string(hide_password=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this expose your database password? I have never used Optuna, so don't really know how this works, but ideally this should still be a secure way of handling credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed expose your database password. I am not sure what are the security consequences of this but I am not seeing a workaround.

Optuna's RDBStorage, requires the url of the RDB in the form of a SQLAlchemy URL string and stores it directly as a class attribute. This string includes the target db, its host, port, password and name. Previously with self._storage = str(storage), the password was hidden and translated to "***" within the string which meant authentication to a password-protected RDB was impossible. Now even if I instantiate the RDBStorage class in the dataset class constructor to reuse it in save/load methods, its url class attribute would still contain all the credentials.

Would you have any pointers regarding to the security scenario where that would be problematic or to a discussion regarding secure handling of credentials? I am happy to learn more about this and try and come back with a better solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merelcht
It's been a few months already and I would really like to get this merge as the current StudyDataset does not really allow for distributed hyperparameter tuning. I am happy to work on a solution but I need some pointers as mentioned above. Please let me know :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gtauzin , sincere apologies for not getting back to you earlier! This slipped my mind during and after the 1.0 release of Kedro.

My main concern is whether this could accidentally leak credentials and make it possible for bad actors to retrieve your database keys. Would it work if you just directly call:

storage = optuna.storages.RDBStorage(url=self._storage_url)

And skip the rendering as string?

Copy link
Member

@deepyaman deepyaman Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the password handling and don't necessarily see something better. The rendered password is only in a variable, not in anything saved to the class, right? Unless I'm missing something in my pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Indeed, all credentials are stored in a class variable _storage_url which means they can be directly accessed as my_dataset._storage_url.render_as_string(hide_password=False). This is the only place where they appear.

@merelcht No worries! Initially I figured you were busy with 1.0 and then I forgot about it until I started needing it again :)

For your suggestion, the url parameter required by RDBStorage is a string, which is why I use the render_as_string method.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

Is the problem related to credentials are stored in the class and can be accessed clearly or the presence of a line of code which exposes them clearly?

Thanks to both of you for the help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

This is a fair point! I don't have a good alternative at this point either and don't want to be a blocker to get this in. My suggestion is to merge this now as is and come back to it in case it becomes an issue. Thanks for your patience @gtauzin ! 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht Thank you :)

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
@gtauzin
Copy link
Contributor Author

gtauzin commented Oct 16, 2025

It seems like the failing test is unrelated to the changes made in this PR. How should I address this? Thanks.

@deepyaman
Copy link
Member

It seems like the failing test is unrelated to the changes made in this PR. How should I address this? Thanks.

You can ignore, unless you want to debug it for fun. Maybe the government shutdown took down the demo NASA data. 😜

@merelcht
Copy link
Member

It seems like the failing test is unrelated to the changes made in this PR. How should I address this? Thanks.

This was just a random API timeout. All fine now, so I'm merging! 🎉

@merelcht merelcht merged commit 37e81b0 into kedro-org:main Oct 17, 2025
16 of 23 checks passed
@gtauzin gtauzin deleted the fix/optuna_pwd_db branch October 17, 2025 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants